Data Simulation

Simulating networks

We adopt a top-down approach to simulate hierarchical networks, considering various simulation parameters such as graph sparsity, noise, and the architecture of the super-level graph(s), including small-world, scale-free, and random graph networks (Watts and Strogatz 1998; Barabási and Bonabeau 2003).

Our simulations focus on basic hierarchies comprising one or two hierarchical layers. Two-layer networks mirror classical community detection on graphs, where our aim is to recover the true community labels from a given graph. Meanwhile, three-layer networks present a more intricate scenario, where the bottom layer of the hierarchy contains two levels of community structure. Here, the top level corresponds to the nodes at the uppermost layer of the hierarchy, and the middle level consists of communities nested within the top-level communities. The objective with these networks is to identify both sets of community partitions.

In each hierarchy, for fully connected networks, we initiate by simulating \(N_{\text{top}}\) top-level nodes, adhering to a directed small-world, random graph, or scale-free network architecture (Watts and Strogatz 1998; Barabási and Bonabeau 2003). In cases where the network is disconnected, we simply simulate \(N_{\text{top}}\) disconnected nodes. For networks with three hierarchical layers, we then generate a subnetwork of \(N_{\text{middle}}\) nodes from each top-layer node, adhering to the network structure utilized at the top level. If the network is fully connected, we apply a probability \(p_\text{between}\) to the nodes from different top-level communities being connected.

The final step in all hierarchies is to generate the nodes in the observed (bottom) layer of the hierarchy. For each top-layer or middle-layer node, we generate a sub-network of \(N_{\text{bottom}}\) nodes under the same sub-network structure as the previous layers, and we apply a probability \(p_\text{between}\) for nodes from different communities to share an edge.

Simulating gene expression

Once we simulate a hierarchical graph, we utilize this hierarchy to generate the node-feature matrix, which depicts the expression of \(N\) genes across \(d\) samples. Here, \(N\) denotes the number of nodes in the observed (bottom) layer of the hierarchy.

We simulate the node-feature matrix using the topological order of the observed level graph. We start by generating the features of nodes that have no parental input. We refer to these nodes as origin nodes. All origin nodes are simulated from a normal distribution with mean \(0\) and standard deviation \(\sigma\). All other nodes are simulated from a normal distribution centered at the mean of their parent nodes and with standard deviation \(\sigma\).

Hierarchical Commuity Detection (HCD) Overview

Our HCD method consists of two modules:

  1. A graph autoencoder based on the architecture proposed by Salehi and Davulcu (2019) which utilizes graph attention layers such as those first introduced by Veličković et al. (2017) (See most recent version of pseudocode for details). In our applications, we incorporate multi-head attention in all encoder and decoder layers to expand model learning capacity. The graph autoencoder module takes a set of node attributes \({\bf X}\in\mathbb{R}^{N \times d}\) and an adjacency matrix \({\bf A}_{ij}\in[0,1]\) defining the relationships between the node as input and learns a low dimensional embedding \({\bf Z} \in \mathbb{R}^{N \times q}\) of the network and attributes. This embedding is then used to reconstruct the node attributes and adjacency matrix under a separate loss function for each.

  2. The second module of HCD takes the embeddings \({\bf Z}\) generated by the autoencoder and applies a multilevel community detection process. This module first applies the function \(f_{top}\) which partitions the data into \(k\) groups. The function \(f_{top}\) can be either (i) the \(SoftKMeans\) layer described by (Falkner 2022) or a neural layer such as

Simple Linear Layer:

\[{\bf P} = f_{top}({\bf Z}) = \text{Softmax}(\theta({\bf ZW} + {\bf b}))\] The above example is a simple fully connected layer where \({\bf W} \in \mathbb{R}^{N \times k}\) is a matrix of parameters, \({\bf b} \in \mathbb{R}^N\) is a bias parameter vector and \(\theta\) is an activation function such as the \(ReLU\) function.

Alternative types of layer can also be used such heuristic KMeans or other GNN-type layers such as GraphSAGE convolution [@].

Using gradient optimized SoftKMeans

\[{\bf P} = f_{top}({\bf Z}) = \text{Softmax}(g({\bf Z})) \] where \(g()\) represents the KMeans algorithm (Falkner 2022)

GraphSAGE Convolutional Layer

\[{\bf p}_v = f_{top}({\bf z}_v) = \text{Softmax}\left(\theta\left( W · CONCAT\big({\bf z}_v,\ AGGREGATE\big({\bf z}_u : u \in N(v) \big) \big) \right)\right)\] where \({\bf p}_v\) is the soft assignment of node \(v\)

Simulation Study

Here we describe the settings for three different sets of simulations. For each set of simulations, we simulated hierarchical gene networks consisting of 5 top level nodes/communities, 15 middle level nodes/communities and 300 nodes/genes at the observed level of the hierarchy. We simulate networks for all three graph types (small world, scale free or random graph) with the nodes at the top level of the hierarchy either all connected or all disconnected. We also simulated nodes under two levels of noise with noise set to either \(\sigma = 0.1\) or \(\sigma = 0.5\). Additionally, we sample the probability for edges within and between node communities at the middle layer and bottom level from the uniform distributions

Each combination of parameters (graph type, connectivity, and noise) is replicated multiple times.

To evaluate the performance of HCD, we conducted a series of simulation studies under various conditions. Below, we describe each simulation condition in detail:

Linear-GT

  • Description: HCD employs linear neural layers for both the top and middle layer partitions.
  • Setup:
    • Top layer output size: 5 (ground truth number of communities).
    • Middle layer output size: 15 (ground truth number of communities).
  • Purpose: To assess performance when prior information about the true community structure is provided.

SoftKmeans-15-5

  • Description: SoftKmeans is used to perform the top partitioning, while the middle layer employs a linear neural network.
  • Setup:
    • Top layer: \(k = 5\) (ground truth number of communities).
    • Middle layer output size: 15 (ground truth number of communities).
  • Purpose: To compare the performance of SoftKmeans with the HCD hierarchical framework.

Linear-64-64

  • Description: Linear neural layers are used for both the top and middle layers without incorporating prior information about community structure.
  • Setup:
    • Top layer output size: 64.
    • Middle layer output size: 64.
  • Purpose: To evaluate HCD’s ability to detect communities in the absence of ground truth information.

Linear-BH-64

  • Description: Linear neural layers are used for both layers, with the top layer leveraging the Bethe Hessian estimate for the number of communities.
  • Setup:
    • Top layer output size: Bethe Hessian estimate of the number of communities based on the input graph.
    • Middle layer output size: 64.
  • Purpose: To examine the effectiveness of incorporating the Bethe Hessian estimate in guiding top-layer partitioning.

GATCONV-BH-64

  • Description: Graph attention layers (GAT) are used for both the top and middle layers.
  • Setup:
    • Top layer output size: Bethe Hessian estimate of the number of communities.
    • Middle layer output size: 64.
  • Purpose: To evaluate the impact of using graph attention (GAT) in hierarchical partitioning.

SAGECONV-BH-64

  • Description: GraphSAGE convolutional layers are employed for both the top and middle layers.
  • Setup:
    • Top layer output size: Bethe Hessian estimate of the number of communities.
    • Middle layer output size: 64.
  • Purpose: To investigate the utility of GraphSAGE in hierarchical clustering.

SAGECONV-BH-64-DW

  • Description: Similar to the SAGECONV-BH-64 setup, but with adjusted clustering loss function hyperparameters.
  • Setup:
    • Top layer output size: Bethe Hessian estimate of the number of communities.
    • Middle layer output size: 64.
    • Additional modification: Clustering component hyperparameters in the loss function are downweighted.
  • Purpose: To determine if downweighting clustering loss hyperparameters improves performance.

Summary

These simulation conditions are designed to rigorously assess the performance of HCD under diverse scenarios. By varying the partitioning methods, output sizes, and use of prior information, we aim to gain insights into HCD’s robustness, adaptability, and effectiveness in uncovering hierarchical community structures in genomic data.

Evaluating performance

We evaluate the performance of our HCD method using three graph-based clustering metrics:

  1. homogeneity evaluates the degree to which each predicted community contains only data points from a single true community, indicating how well the algorithm avoids mixing different groups. Thus, homogeneity tends to be high if resolved communities contain only members of the same true community.

  2. completeness assesses the extent to which all data points that belong to the same true community are correctly assigned to a single predicted community. Thus completeness is always high if all members of the same true communities end up in the same resolved community even if several true communities are allocated together.

  3. NMI Normalized Mutual Information (NMI) is a weighted average of the Completeness and Homogeneity two metrics.

  4. ARI The Adjusted Rand Index (ARI) is a metric that measures the similarity between two different clusterings of the same data, correcting for the chance of random agreement. It ranges from -1 to 1, where 1 indicates perfect agreement between the clusterings, 0 represents random labeling, and negative values indicate less agreement than expected by chance.

In all simulations we compare HCD with two commonly used heuristic methods:

  • Louvain Method: This widely used community detection algorithm optimizes modularity by iteratively reassigning nodes between communities and merging communities. It automatically determines the number of communities and is highly efficient for processing large networks. In all simulations, we apply the Louvain method to the estimated adjacency matrix, which is derived from the correlation matrix of the node features. The resulting community predictions are compared to the ground truth for both the top and middle layers of the simulated hierarchy.

  • Hierarchical Clustering using Ward’s Distance (HC): This agglomerative clustering approach iteratively merges clusters to minimize the increase in within-cluster variance. Ward’s method tends to produce compact, spherical clusters and generates a dendrogram representing the full hierarchical structure of the data. In all simulations, HC is applied to the simulated gene expression data (node feature matrix). The optimal community predictions are determined by identifying the best cutting point on the dendrogram, aligning with the ground truth clusters (five clusters and 15 clusters).

Ways of Estimating K

There are several classical approaches to estimating the number of communities, \(k\), in unsupervised learning. Among these, two commonly adopted data-driven approaches are the Elbow Method and the optimization of a cluster quality metric, such as the Silhouette Score. Additionally, we also consider a graph-based approach known as the Bethe Hessian estimate.

  • Elbow Method:
    The Elbow Method identifies the optimal number of clusters by plotting the clustering cost (e.g., within-cluster sum of squares) against the number of clusters, \(k\). The “elbow” point, where the rate of decrease in the cost slows significantly, is taken as the optimal \(k\). This method assumes that beyond this point, adding more clusters does not improve clustering quality significantly.

  • Silhouette Method:
    The Silhouette Method evaluates the quality of clustering by measuring how similar data points are to their assigned cluster (cohesion) compared to other clusters (separation). The Silhouette score ranges from -1 to 1, where higher values indicate better-defined clusters. The optimal \(k\) is the value that maximizes the average Silhouette score across all data points.

  • Bethe Hessian:
    The Bethe Hessian is a graph-based method for estimating the number of communities in a network. It involves constructing the Bethe Hessian matrix \({\bf B}_\eta = (\eta^2 - 1){\bf I} + {\bf D} - \eta {\bf A}\), where \(A\) is the adjacency matrix, \(D\) is the degree matrix, and \(\eta\) is a regularization parameter. The number of negative eigenvalues of \(H(r)\) provides an estimate of the number of communities. This approach leverages spectral properties of the graph to identify structural groupings. According to [@]

Points to investigate

The Hierarchical Community Detection (HCD) algorithm performs poorly on the top layer of the hierarchy, despite the clear community structure at this level especially in the case of the disconnected networks. This observation warrents further investigation into the model dynamics to determine why the community structure clearly reflected in embeddings is not also captured by the loss function.

  • To help debug the issue noted above, we generated a small world disconnected network with 300 bottom layer nodes, 15 middle layer communities and 5 top layer communities (see Image 1). The plot of the adjacency matrix and heatmap of the data correlations clearly reflects the imposed community structure. This Example will be used throughout the debugging process described below.
Figure 1: Heatmap of the correlations in the data and true adjacency matrix for a randomly generated example small world disconnected network with standard deviation 0.1 and intermediate size (300 nodes). This example was used to debug the model.
Figure 1: Heatmap of the correlations in the data and true adjacency matrix for a randomly generated example small world disconnected network with standard deviation 0.1 and intermediate size (300 nodes). This example was used to debug the model.

Debugging to determine model prediction disparity:

Using the example network shown above, we debugged each stage of training HCD investigating both the loss calculation and visually determining whether the computed probabilities and intermediate steps accurately reflect community structure captured in embeddings produced by the autoencoder GATE.

The clustering loss for the \(\ell^{th}\) hierarchical layer is computed as:

\[{\bf D}_\ell = {\bf X}^T - {\bf M}_\ell {\bf P}_\ell^T \]

\[ L_{C_\ell} = tr\left(\sqrt{\text{diag}({\bf D}_\ell^T{\bf D}_\ell)}\right) \]

Where \({\bf D}_\ell \in \mathbb{R}^{n \times N_\ell}\), \({\bf P}_\ell \in \mathbb{R}^{N \times k_\ell}\), and

\[ {\bf M}_\ell = {\bf X}^T {\bf P}_\ell \cdot \left({\bf 1}_N^T {\bf P}_\ell\right)^{-1} \in \mathbb{R}^{n \times k_\ell} \]

are the centroids for the community predictions in the \(\ell^{th}\) hierarchical layer.

In general, this calculation seems consistent with the results of the model. The issue appears to be that the steps converting the embeddings to logits and eventually probabilities via softmax are getting distorted.

Recall that HCD consists of two modules: - 1. An autoencoder which produces a an embedding \(\bf Z\) - 2. A prediction model which converts the embedding into logits which capture model structure and then converts these logits into pseudo-probabilities vias the softmax activation function. This is the standard deep learning procedure for multiclass classification.

The previous framework for the prediction model is described in detail mathematically below:

Previous Output Layer Framework

The embeddings \({\bf M}\) were computed as

\[ {\bf M} = \text{LeakyReLU}({\bf Z} {\bf W}_{out} + \beta_{out}) \]

where \({\bf Z}\in\mathbb{R}^{N \times q}\) is the input embedding from GATE, \({\bf M}\in\mathbb{R}^{N \times k}\) is the output embedding, and \({\bf W}\in\mathbb{R}^{q \times k}\) is a matrix of learnable parameters.

The embedding \({\bf M}\) is then normalized using an element-wise affine:

\[ {\bf H} = \text{LayerNorm}({\bf M}) = \frac{{\bf M}_{ij} - \mathbb{E}[{\bf M}_i]}{\sqrt{\text{Var}[{\bf M}_i] + \epsilon}}\cdot \gamma + \beta \]

The logits of the normalized embedding \({\bf H}\) are then converted to probabilities via softmax. However, a dropout layer was mistakenly applied to \({\bf H}\) before the softmax activation. While this dropout layer was intended as part of the original architecture, its placement was incorrect, as channels should not be zeroed out before the final activation of the output layer in any architecture.

\[{\bf P} = \text{Softmax}({\bf H}) \]

This framework has two distinct problems:

    1. Activation with LeakyReLU prior to normalizing embeddings and computing logits distorts the signals, washing out fine-grained community structure. This introduces artifacts in the top hierarchy layer.
    1. In addition, an erroneously added dropout layer in output layers was directly responsible for misclassified nodes.

See the example in Figure 2 below which shows the embeddings at the first and last training epoch under this framework:

Figure 2 This example illustrates the GATE embedding {\bf Z} and the output embedding \bf M before and after activation and normalization under the previous prediction framework. As shown, activation and normalization distort the community signals, causing the model to incorrectly identify three communities instead of the five clearly visible in the GATE embedding \bf Z
Figure 2 This example illustrates the GATE embedding \({\bf Z}\) and the output embedding \(\bf M\) before and after activation and normalization under the previous prediction framework. As shown, activation and normalization distort the community signals, causing the model to incorrectly identify three communities instead of the five clearly visible in the GATE embedding \(\bf Z\)

After inspecting the embeddings at each stage of this computation, it was observed that activating the embedding with LeakyReLU before converting logits distorted the signals in the data, essentially washing out fine-grained community structure. This led to artifact community signals in the top layer of the hierarchy. Instead, we seek to address this fundamental issue in the prediction model using the following revised framework:

Revised Framework For Output Layer

A new intermediate embedding {} is computed

\[ {\bf M} = {\bf Z} {\bf W}_{1} + \beta_1 \]

\[{\bf M}_{norm} = \text{LayerNorm}({\bf M}) \]

where \({\bf M}\in\mathbb{R}^{N \times q}\) is an intermediate output embedding, and \({\bf W}\in\mathbb{R}^{q \times q}\) is a matrix of learnable parameters that casts \(\bf Z\) to the same dimension. Again, \({\bf M}\) is normalized using an element-wise affine:

\[ H = \text{LeakyReLU}({\bf M}_{norm}) \]

\(\bf H\) is then converted to the output embedding using a single fully connected layer

\[ {\bf O} = {\bf H}{\bf W}_{out} + \beta_{out} \]

\[{\bf P} = \text{Softmax}({\bf O}) \]

This revised framework eliminates the activation and normalization of embeddings before computing logits, removes inappropriate layer dropout, and ensures that the output embedding effectively shifts the logits to the appropriate embedding space. These adjustments preserve fine-grained and hierarchical community signals through three key changes:

  1. Compute the intermediate embedding \({\bf M}\)
  • Generate the intermediate representation from the input embedding.
  1. Normalize before activation:
  • Perform normalization on the embeddings prior to applying the LeakyReLU activation. This ensures that the distribution remains consistent and is not distorted by the non-linear effects of LeakyReLU.
  1. Directly map to the output dimension:
  • Convert the final intermediate embedding to the output dimension using a fully connected layer, applied just before the softmax activation. This minimizes distortion and ensures proper alignment of the logits with the target space.
This example illustrates the GATE embedding {\bf Z} and the output embedding \bf M before and after activation and normalization under the revised prediction framework. As shown, the revised framework produces an overall more informative intermediate embedding and removes distortion effects
This example illustrates the GATE embedding \({\bf Z}\) and the output embedding \(\bf M\) before and after activation and normalization under the revised prediction framework. As shown, the revised framework produces an overall more informative intermediate embedding and removes distortion effects

#Both-Linear 5, 15

#SoftKMeans-Linear-5,15

#Both-Linear-BH-64

Barabási, Albert-László, and Eric Bonabeau. 2003. “Scale-Free Networks.” Scientific American 288 (5): 60–69.
Falkner, J. K. 2022. “Torch-Kmeans.” GitHub Repository. https://github.com/jokofa/torch_kmeans; GitHub.
Salehi, Amin, and Hasan Davulcu. 2019. “Graph Attention Auto-Encoders.” arXiv Preprint arXiv:1905.10715.
Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. “Graph Attention Networks.” arXiv Preprint arXiv:1710.10903.
Watts, Duncan J, and Steven H Strogatz. 1998. “Collective Dynamics of ‘Small-World’networks.” Nature 393 (6684): 440–42.